Use of a Large-scale Spontaneous Speech Corpus in the Study of Linguistic Variation

نویسندگان

Kikuo Maekawa

Hanae Koiso

Hideaki Kikuchi

Kiyoko Yoneyama

چکیده

Corpus of Spontaneous Japanese, or CSJ, is a large-scale database of spontaneous Japanese. It contains speech signal and transcription of about 7 million words along with various annotations like POS and phonetic labels. After describing its design issues, the potential of the CSJ as a resource for linguistic variation study was evaluated.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analysis of Language Variation Using a Large-Scale Corpus of Spontaneous Speech

Large-scale corpus of spontaneous speech can be a powerful tool for the study of language variation. Moreover, given that the corpus is publicly available, corpus-based analysis could open up the possibility of follow-up analysis in this area of linguistic study. Generally speaking, follow-up study is highly desirable in sciences but so far it has been virtually impossible in the area of socio-...

متن کامل

Discrimination of Linguistic and Non-Linguistic Vocalizations in Spontaneous Speech: Intra- and Inter-Corpus Perspectives

We present a large-scale study on classification of linguistic and non-linguistic vocalizations including laughter, vocal noise, hesitation and consent on four corpora amounting to 46 h of spontaneous conversational speech. We consider training and testing on speaker-independent subsets of single corpora (intracorpus) as well as inter-corpus experiments where models built on one or more corpora...

متن کامل

A Japanese National Project on Spontaneous Speech Corpus and Processing Technology

A new national project for raising the technological level of speech recognition and understanding has recently commenced in Japan. This project aims at a) building a large-scale spontaneous speech corpus consisting of roughly 7M words and 800 hours of speech, b) acoustic and linguistic modeling for spontaneous speech understanding and summarization using linguistic as well as para-linguistic i...

متن کامل

Why Is the Recognition of Spontaneous Speech so Hard?

Although speech, derived from reading texts, and similar types of speech, e.g. that from reading newspapers or that from news broadcast, can be recognized with high accuracy, recognition accuracy drastically decreases for spontaneous speech. This is due to the fact that spontaneous speech and read speech are significantly different acoustically as well as linguistically. This paper reports anal...

متن کامل

Benchmark Test for Speech Recognition Using the Corpus of Spontaneous Japanese

We present benchmark results of automatic speech recognition using the Corpus of Spontaneous Japanese (CSJ), which has been developed in the five-year national project and will be the largest spontaneous speech databases. New test-sets are designed for both academic presentation speech and extemporaneous public speech, which are the two major categories in the corpus. The testsets are selected ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

Use of a Large-scale Spontaneous Speech Corpus in the Study of Linguistic Variation

نویسندگان

چکیده

منابع مشابه

Analysis of Language Variation Using a Large-Scale Corpus of Spontaneous Speech

Discrimination of Linguistic and Non-Linguistic Vocalizations in Spontaneous Speech: Intra- and Inter-Corpus Perspectives

A Japanese National Project on Spontaneous Speech Corpus and Processing Technology

Why Is the Recognition of Spontaneous Speech so Hard?

Benchmark Test for Speech Recognition Using the Corpus of Spontaneous Japanese

عنوان ژورنال:

اشتراک گذاری